{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "*************************************************\n",
    "*************************************************\n",
    "<h1 align=center>第五章 同级红酒关注群体发现用以辅助决策</h1> \n",
    "\n",
    "<a id=\"ref\"></a>\n",
    "## 目录表\n",
    "\n",
    "<div class=\"alert alert-block alert-info\" style=\"margin-top: 20px\">\n",
    "<li><a href=\"#ref0\">实验5-1 清洗微博数据</a></li>\n",
    "<li><a href=\"#ref1\">实验5-2 利用聚类发现同级红酒</a></li>\n",
    "<li><a href=\"#ref2\">实验5-3 利用pandaBI构建客户群体各维度分布图</a></li>\n",
    "\n",
    "    \n",
    "<br>\n",
    "<p></p>\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "************************************************************\n",
    "************************************************************\n",
    "<a id=\"ref0\"></a>\n",
    "<h2 align=center>实验5-1 清洗微博数据</h2>\n",
    "\n",
    "建议课时：30分钟\n",
    "\n",
    "## 一、实验目的\n",
    "\n",
    "1、了解pandas的常用函数\n",
    "\n",
    "2、了解数据清洗的基础方法\n",
    "\n",
    "## 二、实验环境\n",
    "Python3开发环境，第三方包有pandas\n",
    "\n",
    "## 三、实验原理\n",
    "本实验将利用pandas该python第三方包对爬取的数据进行清洗。pandas 是基于NumPy 的一种工具，该工具是为了解决数据分析任务而创建的。Pandas 纳入了大量库和一些标准的数据模型，提供了高效地操作大型数据集所需的工具。pandas提供了大量能使我们快速便捷地处理数据的函数和方法。你很快就会发现，它是使Python成为强大而高效的数据分析环境的重要因素之一。\n",
    "\n",
    "## 四、实验步骤\n",
    "本节处理的内容有：\n",
    "### 步骤1:读取数据\n",
    "读取数据的方式和第二章的读取数据一致，都是采用 json.loads() 函数读成字典类型。\n",
    "<div class=\"alert alert-success alertsuccess\" style=\"margin-top: 10px\">\n",
    "代码示例：\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import csv\n",
    "import json\n",
    "csvfile = open('./data/微博红酒.csv', 'r')\n",
    "reader = csv.reader(csvfile)#读取到的数据是将每行数据当做列表返回的\n",
    "rows = []#用来存储解析后的没条数据\n",
    "for row in reader:\n",
    "    row_str = \",\".join(row)#row为list类型需转为str，该数据变为字典型字符串\n",
    "    row_dict = json.loads(row_str)\n",
    "        \n",
    "    #将每行数据中嵌套字典拆开存储到列表中\n",
    "    newdict = {}\n",
    "    for k in row_dict:\n",
    "        if type(row_dict[k]) == str:#将键值对赋给新字典\n",
    "            newdict[k] = row_dict[k]\n",
    "        elif type(row_dict[k]) == dict:#若存在嵌套字典，将该字典中的key和value作为属性和属性值\n",
    "            newdict.update(row_dict[k])\n",
    "    rows.append(newdict)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "24611"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "len(rows)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'id': '10099536052',\n",
       " 'url': 'https://weibo.com/1823411197/ghE0Li',\n",
       " 'keyword': '莱茵黑森 红酒',\n",
       " 'post_time': '2019-05-15 14:25:15',\n",
       " 'post_text': '今晚有红衫鱼  蚝仔烙 粉丝带子，如果把这支红酒换成加拿大冰酒或者99年莱茵黑森白葡就完美啦哈哈哈哈超有满足感的一餐，脂肪来吧来吧 ',\n",
       " 'post_source': '彩信',\n",
       " 'description': '还是很想你……dbx',\n",
       " 'gender': 'f',\n",
       " 'name': '舞夜eva',\n",
       " '昵称': '舞夜eva',\n",
       " '所在地': '广东 广州',\n",
       " '性别': '女',\n",
       " '生日': '11月4日',\n",
       " '博客': '',\n",
       " '个性域名': 'https://weibo.com/evasdream',\n",
       " '简介': '还是很想你……dbx',\n",
       " '注册时间': '2010-09-25',\n",
       " '公司': '中外|职位：设计部',\n",
       " '标签': '小蝎子|设计师|化妆师|大胃王|小肉丸'}"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "rows[2]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "可以得到24611条数据，"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(24611, 32)\n",
      "Index(['', 'MSN', 'QQ', 'description', 'gender', 'id', 'keyword', 'name',\n",
      "       'post_source', 'post_text', 'post_time', 'url', '个性域名', '中专技校', '公司',\n",
      "       '初中', '博客', '大学', '小学', '性别', '性取向', '感情状况', '所在地', '昵称', '标签', '注册时间',\n",
      "       '生日', '真实姓名', '简介', '血型', '邮箱', '高中'],\n",
      "      dtype='object')\n"
     ]
    }
   ],
   "source": [
    "import pandas as pd\n",
    "df = pd.DataFrame(rows)#将字典型数组转为DataFrame形式\n",
    "print(df.shape)\n",
    "print(df.columns)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "处理搜索关键词，即keyword字段，处理红酒品牌的中英文表示，代码同上一章节类似，\n",
    "<div class=\"alert alert-success alertsuccess\" style=\"margin-top: 10px\">\n",
    "代码示例：\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(24611, 6)\n",
      "Index(['keyword', 'post_time', '所在地', '性别', '生日', 'gender'], dtype='object')\n"
     ]
    }
   ],
   "source": [
    "df = df[[\"keyword\",\"post_time\",\"所在地\",\"性别\",\"生日\",\"gender\"]]\n",
    "print(df.shape)\n",
    "print(df.columns)\n",
    "#将DataFrame 写入到 csv 文件\n",
    "df.to_csv(\"wine_df_weibo.csv\",encoding=\"utf-8_sig\", index = False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "0               卓林/zonin\n",
      "1    怡园酒庄/grace vineyard\n",
      "2                   莱茵黑森\n",
      "3          长城/greatwall \n",
      "4                     奥兰\n",
      "5             华东/huadong\n",
      "6       黄尾袋鼠/yellow tail\n",
      "7       黄尾袋鼠/yellow tail\n",
      "8                     奥兰\n",
      "9                    安徒生\n",
      "Name: keyword, dtype: object\n"
     ]
    }
   ],
   "source": [
    "# 处理红酒名称（统一规范中文/英文的格式）\n",
    "#去keyword中“红酒”字符\n",
    "df['keyword']=df['keyword'].str.split('红酒').str[0]\n",
    "\n",
    "#整理红酒品牌\n",
    "brand = pd.read_csv(\"./data/红酒品牌.csv\",header=None,encoding=\"utf-8\")\n",
    "brands =[]\n",
    "for k1,k2 in zip(list(brand[0]),list(brand[1])):\n",
    "    if pd.isnull(k2):\n",
    "        brands.append(k1)\n",
    "    else:\n",
    "        brands.append(k1+\"/\"+k2)\n",
    "        \n",
    "# 品牌替换\n",
    "def modify_keywords(w, lists):\n",
    "    for b in lists:\n",
    "        if w.strip() in b:\n",
    "            return b\n",
    "    return w.strip()\n",
    "\n",
    "\n",
    "df['keyword'] = df['keyword'].apply(modify_keywords, lists =brands)\n",
    "df['keyword']=df['keyword'].str.lower()#转化为小写\n",
    "\n",
    "print(df[\"keyword\"].head(10))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "接下来我们整理想要的属性，包含\"keyword\",\"post_time\",\"所在地\",\"性别\",\"生日\",\"gender\"这六种属性，保存临时处理文件，用以观察。\n",
    "## 步骤2:处理性别\n",
    "\n",
    "用 df.count 函数查看各维度的非缺失值，可见gender字段比性别可行度高，使用 df.groupby 验证gender中是否存在异常值\n",
    "<div class=\"alert alert-success alertsuccess\" style=\"margin-top: 10px\">\n",
    "代码示例：\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>keyword</th>\n",
       "      <th>post_time</th>\n",
       "      <th>所在地</th>\n",
       "      <th>性别</th>\n",
       "      <th>生日</th>\n",
       "      <th>gender</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>卓林/zonin</td>\n",
       "      <td>2019-05-15 14:29:29</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>m</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>怡园酒庄/grace vineyard</td>\n",
       "      <td>2019-05-15 14:25:30</td>\n",
       "      <td>香港 其他</td>\n",
       "      <td>女</td>\n",
       "      <td>NaN</td>\n",
       "      <td>f</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "               keyword            post_time    所在地   性别   生日 gender\n",
       "0             卓林/zonin  2019-05-15 14:29:29    NaN  NaN  NaN      m\n",
       "1  怡园酒庄/grace vineyard  2019-05-15 14:25:30  香港 其他    女  NaN      f"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.head(2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 处理性别\n",
    "# df.count()\n",
    "# df.groupby(\"gender\").size().sort_values(ascending=False)\n",
    "df.drop(\"性别\",axis=1,inplace=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 步骤3:处理年龄\n",
    "据观察，出生字段有人写的是日期，有人写的是星座，有人写的是年月日因而采用提取年份前面的数字再与当前年份相减得到年龄值。\n",
    "<div class=\"alert alert-success alertsuccess\" style=\"margin-top: 10px\">\n",
    "代码示例：\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 处理年龄\n",
    "# df.count()\n",
    "def age(x):\n",
    "    index = str(x).find('年')\n",
    "    if index == -1 :\n",
    "        return 9999\n",
    "    else:\n",
    "        res = 2019-int(str(x)[:index])\n",
    "        # 超过100岁，视为不合理\n",
    "        if res>100:\n",
    "            return 9999\n",
    "        else:\n",
    "            return 2019-int(str(x)[:index])\n",
    "\n",
    "df['age'] = df['生日'].map(age)\n",
    "df.drop(\"生日\",axis=1,inplace=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "结果如下所示："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>keyword</th>\n",
       "      <th>post_time</th>\n",
       "      <th>所在地</th>\n",
       "      <th>gender</th>\n",
       "      <th>age</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>卓林/zonin</td>\n",
       "      <td>2019-05-15 14:29:29</td>\n",
       "      <td>NaN</td>\n",
       "      <td>m</td>\n",
       "      <td>9999</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>怡园酒庄/grace vineyard</td>\n",
       "      <td>2019-05-15 14:25:30</td>\n",
       "      <td>香港 其他</td>\n",
       "      <td>f</td>\n",
       "      <td>9999</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>莱茵黑森</td>\n",
       "      <td>2019-05-15 14:25:15</td>\n",
       "      <td>广东 广州</td>\n",
       "      <td>f</td>\n",
       "      <td>9999</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>长城/greatwall</td>\n",
       "      <td>2019-05-15 14:24:20</td>\n",
       "      <td>上海 浦东新区</td>\n",
       "      <td>m</td>\n",
       "      <td>34</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>奥兰</td>\n",
       "      <td>2019-05-15 14:15:26</td>\n",
       "      <td>北京</td>\n",
       "      <td>f</td>\n",
       "      <td>9999</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>华东/huadong</td>\n",
       "      <td>2019-05-15 14:27:43</td>\n",
       "      <td>山东 青岛</td>\n",
       "      <td>f</td>\n",
       "      <td>9999</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>黄尾袋鼠/yellow tail</td>\n",
       "      <td>2019-05-15 14:25:05</td>\n",
       "      <td>辽宁 大连</td>\n",
       "      <td>m</td>\n",
       "      <td>9999</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>黄尾袋鼠/yellow tail</td>\n",
       "      <td>2019-05-15 14:25:05</td>\n",
       "      <td>上海</td>\n",
       "      <td>m</td>\n",
       "      <td>9999</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>奥兰</td>\n",
       "      <td>2019-05-15 14:15:26</td>\n",
       "      <td>广东 深圳</td>\n",
       "      <td>f</td>\n",
       "      <td>9999</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>安徒生</td>\n",
       "      <td>2019-05-15 14:27:42</td>\n",
       "      <td>NaN</td>\n",
       "      <td>m</td>\n",
       "      <td>9999</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>10</th>\n",
       "      <td>安徒生</td>\n",
       "      <td>2019-05-15 14:27:42</td>\n",
       "      <td>湖南 长沙</td>\n",
       "      <td>m</td>\n",
       "      <td>9999</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>11</th>\n",
       "      <td>木桐嘉棣/mouton cadet</td>\n",
       "      <td>2019-05-15 14:25:39</td>\n",
       "      <td>北京 朝阳区</td>\n",
       "      <td>f</td>\n",
       "      <td>9999</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>12</th>\n",
       "      <td>加州乐事</td>\n",
       "      <td>2019-05-15 14:24:50</td>\n",
       "      <td>其他</td>\n",
       "      <td>m</td>\n",
       "      <td>9999</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>13</th>\n",
       "      <td>安徒生</td>\n",
       "      <td>2019-05-15 14:27:42</td>\n",
       "      <td>上海 黄浦区</td>\n",
       "      <td>f</td>\n",
       "      <td>9999</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>14</th>\n",
       "      <td>黄尾袋鼠/yellow tail</td>\n",
       "      <td>2019-05-15 14:13:53</td>\n",
       "      <td>广东 广州</td>\n",
       "      <td>f</td>\n",
       "      <td>30</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>15</th>\n",
       "      <td>黄尾袋鼠/yellow tail</td>\n",
       "      <td>2019-05-15 14:25:05</td>\n",
       "      <td>NaN</td>\n",
       "      <td>m</td>\n",
       "      <td>9999</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>16</th>\n",
       "      <td>尼雅/niya</td>\n",
       "      <td>2019-05-15 14:24:58</td>\n",
       "      <td>北京 朝阳区</td>\n",
       "      <td>m</td>\n",
       "      <td>40</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>17</th>\n",
       "      <td>蒙特斯/montes</td>\n",
       "      <td>2019-05-15 14:24:31</td>\n",
       "      <td>四川 成都</td>\n",
       "      <td>m</td>\n",
       "      <td>9999</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>18</th>\n",
       "      <td>蒙特斯/montes</td>\n",
       "      <td>2019-05-15 14:14:11</td>\n",
       "      <td>北京</td>\n",
       "      <td>f</td>\n",
       "      <td>9999</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>19</th>\n",
       "      <td>黄尾袋鼠/yellow tail</td>\n",
       "      <td>2019-05-15 14:25:05</td>\n",
       "      <td>上海 静安区</td>\n",
       "      <td>f</td>\n",
       "      <td>9999</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                keyword            post_time      所在地 gender   age\n",
       "0              卓林/zonin  2019-05-15 14:29:29      NaN      m  9999\n",
       "1   怡园酒庄/grace vineyard  2019-05-15 14:25:30    香港 其他      f  9999\n",
       "2                  莱茵黑森  2019-05-15 14:25:15    广东 广州      f  9999\n",
       "3         长城/greatwall   2019-05-15 14:24:20  上海 浦东新区      m    34\n",
       "4                    奥兰  2019-05-15 14:15:26       北京      f  9999\n",
       "5            华东/huadong  2019-05-15 14:27:43    山东 青岛      f  9999\n",
       "6      黄尾袋鼠/yellow tail  2019-05-15 14:25:05    辽宁 大连      m  9999\n",
       "7      黄尾袋鼠/yellow tail  2019-05-15 14:25:05       上海      m  9999\n",
       "8                    奥兰  2019-05-15 14:15:26    广东 深圳      f  9999\n",
       "9                   安徒生  2019-05-15 14:27:42      NaN      m  9999\n",
       "10                  安徒生  2019-05-15 14:27:42    湖南 长沙      m  9999\n",
       "11    木桐嘉棣/mouton cadet  2019-05-15 14:25:39   北京 朝阳区      f  9999\n",
       "12                 加州乐事  2019-05-15 14:24:50       其他      m  9999\n",
       "13                  安徒生  2019-05-15 14:27:42   上海 黄浦区      f  9999\n",
       "14     黄尾袋鼠/yellow tail  2019-05-15 14:13:53    广东 广州      f    30\n",
       "15     黄尾袋鼠/yellow tail  2019-05-15 14:25:05      NaN      m  9999\n",
       "16             尼雅/niya   2019-05-15 14:24:58   北京 朝阳区      m    40\n",
       "17          蒙特斯/montes   2019-05-15 14:24:31    四川 成都      m  9999\n",
       "18          蒙特斯/montes   2019-05-15 14:14:11       北京      f  9999\n",
       "19     黄尾袋鼠/yellow tail  2019-05-15 14:25:05   上海 静安区      f  9999"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.head(20)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 步骤4: 处理地区\n",
    "据观察，所在地的取值有三种，分别是其他，大地区，大地区加小地区\n",
    "处理方法：\n",
    "\n",
    "替代 缺失值 为 其他\n",
    "\n",
    "延申出两维数据，分别表示为大地区和小地区，没有小地区则表示为其他\n",
    "<div class=\"alert alert-success alertsuccess\" style=\"margin-top: 10px\">\n",
    "代码示例：\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 处理地区\n",
    "df['所在地'].fillna(\"其他\",inplace=True)\n",
    "\n",
    "def place(x):\n",
    "    words = str(x).split(\" \")\n",
    "    if len(words) > 1:\n",
    "        return words[0].strip(), words[1].strip()\n",
    "    else:\n",
    "        return words[0],\"其他\"\n",
    "    \n",
    "df[\"所在地\"] = df[\"所在地\"].map(place)\n",
    "df['country'] = df[\"所在地\"].apply(lambda x : x[0])\n",
    "df['region'] = df['所在地'].apply(lambda x : x[1])\n",
    "df.drop(\"所在地\",axis = 1,inplace = True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "结果如下所示："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>keyword</th>\n",
       "      <th>post_time</th>\n",
       "      <th>gender</th>\n",
       "      <th>age</th>\n",
       "      <th>country</th>\n",
       "      <th>region</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>卓林/zonin</td>\n",
       "      <td>2019-05-15 14:29:29</td>\n",
       "      <td>m</td>\n",
       "      <td>9999</td>\n",
       "      <td>其他</td>\n",
       "      <td>其他</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>怡园酒庄/grace vineyard</td>\n",
       "      <td>2019-05-15 14:25:30</td>\n",
       "      <td>f</td>\n",
       "      <td>9999</td>\n",
       "      <td>香港</td>\n",
       "      <td>其他</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>莱茵黑森</td>\n",
       "      <td>2019-05-15 14:25:15</td>\n",
       "      <td>f</td>\n",
       "      <td>9999</td>\n",
       "      <td>广东</td>\n",
       "      <td>广州</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>长城/greatwall</td>\n",
       "      <td>2019-05-15 14:24:20</td>\n",
       "      <td>m</td>\n",
       "      <td>34</td>\n",
       "      <td>上海</td>\n",
       "      <td>浦东新区</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>奥兰</td>\n",
       "      <td>2019-05-15 14:15:26</td>\n",
       "      <td>f</td>\n",
       "      <td>9999</td>\n",
       "      <td>北京</td>\n",
       "      <td>其他</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>华东/huadong</td>\n",
       "      <td>2019-05-15 14:27:43</td>\n",
       "      <td>f</td>\n",
       "      <td>9999</td>\n",
       "      <td>山东</td>\n",
       "      <td>青岛</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>黄尾袋鼠/yellow tail</td>\n",
       "      <td>2019-05-15 14:25:05</td>\n",
       "      <td>m</td>\n",
       "      <td>9999</td>\n",
       "      <td>辽宁</td>\n",
       "      <td>大连</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>黄尾袋鼠/yellow tail</td>\n",
       "      <td>2019-05-15 14:25:05</td>\n",
       "      <td>m</td>\n",
       "      <td>9999</td>\n",
       "      <td>上海</td>\n",
       "      <td>其他</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>奥兰</td>\n",
       "      <td>2019-05-15 14:15:26</td>\n",
       "      <td>f</td>\n",
       "      <td>9999</td>\n",
       "      <td>广东</td>\n",
       "      <td>深圳</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>安徒生</td>\n",
       "      <td>2019-05-15 14:27:42</td>\n",
       "      <td>m</td>\n",
       "      <td>9999</td>\n",
       "      <td>其他</td>\n",
       "      <td>其他</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "               keyword            post_time gender   age country region\n",
       "0             卓林/zonin  2019-05-15 14:29:29      m  9999      其他     其他\n",
       "1  怡园酒庄/grace vineyard  2019-05-15 14:25:30      f  9999      香港     其他\n",
       "2                 莱茵黑森  2019-05-15 14:25:15      f  9999      广东     广州\n",
       "3        长城/greatwall   2019-05-15 14:24:20      m    34      上海   浦东新区\n",
       "4                   奥兰  2019-05-15 14:15:26      f  9999      北京     其他\n",
       "5           华东/huadong  2019-05-15 14:27:43      f  9999      山东     青岛\n",
       "6     黄尾袋鼠/yellow tail  2019-05-15 14:25:05      m  9999      辽宁     大连\n",
       "7     黄尾袋鼠/yellow tail  2019-05-15 14:25:05      m  9999      上海     其他\n",
       "8                   奥兰  2019-05-15 14:15:26      f  9999      广东     深圳\n",
       "9                  安徒生  2019-05-15 14:27:42      m  9999      其他     其他"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.head(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 步骤5:处理发布时间\n",
    "发布时间在一定程度上可以反映不同时间段对红酒品牌的关注度，可能在某一营销活动或文章之后，关注度有所提升，则可以认为这种营销活动和文章是有一定积极因素的。\n",
    "观察post_time字段，发现均为2019年5月15日发布，由于数据量的局限性，此字段予以删除。学生们可以利用爬虫脚本爬取更多数据量，在此维度上做一些统计方法以辅助营销决策。\n",
    "<div class=\"alert alert-success alertsuccess\" style=\"margin-top: 10px\">\n",
    "代码示例：\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 处理发布日期\n",
    "df.drop(\"post_time\",axis=1,inplace=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>keyword</th>\n",
       "      <th>gender</th>\n",
       "      <th>age</th>\n",
       "      <th>country</th>\n",
       "      <th>region</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>卓林/zonin</td>\n",
       "      <td>m</td>\n",
       "      <td>9999</td>\n",
       "      <td>其他</td>\n",
       "      <td>其他</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>怡园酒庄/grace vineyard</td>\n",
       "      <td>f</td>\n",
       "      <td>9999</td>\n",
       "      <td>香港</td>\n",
       "      <td>其他</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "               keyword gender   age country region\n",
       "0             卓林/zonin      m  9999      其他     其他\n",
       "1  怡园酒庄/grace vineyard      f  9999      香港     其他"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.head(2)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "综上，微博的数据处理完毕，5.2节将结合第三章的处理结果进行同级红酒的发现。\n",
    "\n",
    "## 五、实验总结\n",
    "本节对爬取的微博用户数据信息进行处理，处理的维度有性别，年龄，地区，发布时间，下节将对红酒进行评级，挖掘同类红酒的关注者的共同特征。\n",
    "<div class=\"alert alert-block alert-info\" style=\"margin-top: 20px\" >\n",
    "<li><a href=\"#ref\" >返回目录表</a></li>   \n",
    "<br>\n",
    "<p></p>\n",
    "</div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "*****************************************************************\n",
    "************************************************************\n",
    "<a id=\"ref1\"></a>\n",
    "<h2 align=center>实验5-2 利用聚类发现同级红酒</h2>\n",
    "\n",
    "建议课时：20分钟\n",
    "## 一、实验目的\n",
    "1、了解pandas的常用函数\n",
    "\n",
    "2、了解聚类方法的处理流程\n",
    "\n",
    "## 二、实验环境\n",
    "Python3开发环境，第三方包有pandas，sklearn\n",
    "\n",
    "## 三、实验原理\n",
    "本节将利用聚类思想对红酒进行归类。将物理或抽象对象的集合分成由类似的对象组成的多个类的过程被称为聚类。由聚类所生成的簇是一组数据对象的集合，这些对象与同一个簇中的对象彼此相似，与其他簇中的对象相异。聚类与分类的不同在于，聚类所要求划分的类是未知的。\n",
    "\n",
    "## 四、实验步骤\n",
    "在之前观察数据时，可以发现很多离散数据都含有6-8个不同数据取值，所以此处我们将选择聚类中的k值为6，查看聚类效果。\n",
    "\n",
    "### 步骤1:读取之前为微博数据分析准备的数据\n",
    "<div class=\"alert alert-success alertsuccess\" style=\"margin-top: 10px\">\n",
    "代码示例：\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "winedf = pd.read_csv(\"./prepare_for_weibo.csv\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 步骤2:one-hot编码\n",
    "<div class=\"alert alert-success alertsuccess\" style=\"margin-top: 10px\">\n",
    "代码示例：\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "keyword                     object\n",
       "price                      float64\n",
       "产区                          object\n",
       "原产地                         object\n",
       "口感                          object\n",
       "国产/进口                       object\n",
       "特性                          object\n",
       "甜度                          object\n",
       "颜色                          object\n",
       "冰酒/贵腐/甜酒                     int64\n",
       "白葡萄酒                         int64\n",
       "果味葡萄酒                        int64\n",
       "桃红葡萄酒                        int64\n",
       "红葡萄酒                         int64\n",
       "起泡酒/香槟                       int64\n",
       "马尔贝克（Malbec）                 int64\n",
       "其它                           int64\n",
       "梅洛（Merlot）                   int64\n",
       "仙粉黛（Zinfandel）               int64\n",
       "赤霞珠（Cabernet Sauvignon）      int64\n",
       "西拉/设拉子（Syrah/Shiraz）         int64\n",
       "雷司令（Riesling）                int64\n",
       "桑娇维塞（Sangiovese）             int64\n",
       "黑皮诺（Pinot Noir）              int64\n",
       "长相思（Sauvignon Blanc）         int64\n",
       "内比奥罗（Nebbiolo）               int64\n",
       "佳美（Gamay）                    int64\n",
       "品丽珠（Cabernet Franc）          int64\n",
       "匹诺塔吉（Pinotage）               int64\n",
       "霞多丽（Chardonnay）              int64\n",
       "蛇龙珠（Cabernet Gernischt）      int64\n",
       "佳美娜（Carmenere）               int64\n",
       "year                         int64\n",
       "alcohol                    float64\n",
       "dtype: object"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "winedf.dtypes"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "由于kmeans接受数值型数据，所以需要采用one-hot编码对object类别的数据进行转换"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [],
   "source": [
    "resdf = pd.get_dummies(winedf) #非数值型都转化为one-hot"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "步骤3:聚类分析\n",
    "    代码使用很简单，定好聚类数目即可，其他参数可见官方文档，也可在jupyter编辑器中利用？来查询源码和用法。\n",
    "    <div class=\"alert alert-success alertsuccess\" style=\"margin-top: 10px\">\n",
    "代码示例：\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.cluster import KMeans\n",
    "\n",
    "kmodel = KMeans(n_clusters = 6,n_jobs=5)\n",
    "kmodel.fit(resdf)        #训练模型\n",
    "rs = kmodel.predict(resdf) #类别索引\n",
    "winedf['label'] = rs"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 步骤3:同级红酒发现\n",
    "看一下每个类别中数目较多的红酒类别：\n",
    "<div class=\"alert alert-success alertsuccess\" style=\"margin-top: 10px\">\n",
    "代码示例：\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead tr th {\n",
       "        text-align: left;\n",
       "    }\n",
       "\n",
       "    .dataframe thead tr:last-of-type th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr>\n",
       "      <th></th>\n",
       "      <th colspan=\"3\" halign=\"left\">price</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th></th>\n",
       "      <th>min</th>\n",
       "      <th>max</th>\n",
       "      <th>count</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>label</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>9.83</td>\n",
       "      <td>3399.0</td>\n",
       "      <td>3205</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>4.83</td>\n",
       "      <td>50000.0</td>\n",
       "      <td>4263</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3.32</td>\n",
       "      <td>23040.0</td>\n",
       "      <td>3112</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>16.00</td>\n",
       "      <td>8800.0</td>\n",
       "      <td>206</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>78000.00</td>\n",
       "      <td>79560.0</td>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>3499.00</td>\n",
       "      <td>29800.0</td>\n",
       "      <td>163</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "          price               \n",
       "            min      max count\n",
       "label                         \n",
       "0          9.83   3399.0  3205\n",
       "1          4.83  50000.0  4263\n",
       "2          3.32  23040.0  3112\n",
       "3         16.00   8800.0   206\n",
       "4      78000.00  79560.0     2\n",
       "5       3499.00  29800.0   163"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "winedf.groupby(\"label\").agg({'price':['min','max','count']})"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 专家分析：\n",
    "每个类中的价格最小值和最大值，可见分类界限没那么明确，造成这样的原因主要是两部分：一是聚类数目没有选择好，二是数据的准确度，之前的红酒价格是根据数据规范算出来的，但有可能提取的瓶数是错的，导致数据是不正确的，所以聚类上对品牌会造成影响。\n",
    "但我们可以查看一下每个类中数目占比就多的红酒品牌："
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {},
   "outputs": [],
   "source": [
    "classA = winedf[winedf.label==0].groupby([\"keyword\"]).size().sort_values(ascending=False).reset_index()\n",
    "classA.columns=[\"keyword\",\"A\"]\n",
    "classB = winedf[winedf.label==1].groupby([\"keyword\"]).size().sort_values(ascending=False).reset_index()\n",
    "classB.columns=[\"keyword\",\"B\"]\n",
    "classC = winedf[winedf.label==2].groupby([\"keyword\"]).size().sort_values(ascending=False).reset_index()\n",
    "classC.columns=[\"keyword\",\"C\"]\n",
    "classD = winedf[winedf.label==3].groupby([\"keyword\"]).size().sort_values(ascending=False).reset_index()\n",
    "classD.columns=[\"keyword\",\"D\"]\n",
    "classE = winedf[winedf.label==4].groupby([\"keyword\"]).size().sort_values(ascending=False).reset_index()\n",
    "classE.columns=[\"keyword\",\"E\"]\n",
    "classF = winedf[winedf.label==5].groupby([\"keyword\"]).size().sort_values(ascending=False).reset_index()\n",
    "classF.columns=[\"keyword\",\"F\"]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(120, 2)\n",
      "(172, 2)\n",
      "(136, 2)\n",
      "(56, 2)\n",
      "(1, 2)\n",
      "(11, 2)\n"
     ]
    }
   ],
   "source": [
    "print(classA.shape)\n",
    "print(classB.shape)\n",
    "print(classC.shape)\n",
    "print(classD.shape)\n",
    "print(classE.shape)\n",
    "print(classF.shape)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>keyword</th>\n",
       "      <th>C</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>长城/greatwall</td>\n",
       "      <td>606</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>张裕/changyu</td>\n",
       "      <td>219</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>通化/tonghua</td>\n",
       "      <td>180</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>王朝/dynasty</td>\n",
       "      <td>102</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>拉菲/lafite</td>\n",
       "      <td>98</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>黄尾袋鼠/yellow tail</td>\n",
       "      <td>97</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>杰卡斯/jacob’s creek</td>\n",
       "      <td>97</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>7</th>\n",
       "      <td>罗曼尼康帝庄园</td>\n",
       "      <td>96</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>8</th>\n",
       "      <td>高斯达</td>\n",
       "      <td>89</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>9</th>\n",
       "      <td>奔富/penfolds</td>\n",
       "      <td>88</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "             keyword    C\n",
       "0      长城/greatwall   606\n",
       "1        张裕/changyu   219\n",
       "2         通化/tonghua  180\n",
       "3        王朝/dynasty   102\n",
       "4          拉菲/lafite   98\n",
       "5   黄尾袋鼠/yellow tail   97\n",
       "6  杰卡斯/jacob’s creek   97\n",
       "7            罗曼尼康帝庄园   96\n",
       "8                高斯达   89\n",
       "9        奔富/penfolds   88"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "classC[:10]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 步骤4:品牌分类\n",
    "从结果上看，同样的红酒品牌会在多个类中出现，可以由数量上判断该品牌属于哪个类别，由此整理得到：\n",
    "\n",
    "A\t'长城/greatwall ', '张裕/changyu ', '王朝/dynasty ', '高斯达', '莫高/mogao ', '智象/chilephant', '拉梦堡/lamengbao', '芙华/la fiole', '玛茜/rochemazet', '尼雅/niya ', '天鹅庄', '华东/huadong', '嘉伦多', '罗莎庄园', '蒙大菲/robert mondavi'  等等\n",
    "\n",
    "B\t'拉菲/lafite', '奔富/penfolds', '纷赋/wolfblass', '黄尾袋鼠/yellow tail', '干露/concha y toro ', '蒙特斯/montes ', '贝灵哲/beringer', '杰卡斯/jacob’s creek', '圣丽塔', '威赛帝斯', '卡思黛乐/castel', '璞立/beaulieu vineyard', '也买酒/yesmywine', '麦格根/mcguigan', '玛歌酒庄/chateau margaux', '加州乐事', '音符/awjs'  等等\n",
    "\n",
    "C\t'罗曼尼康帝庄园', '通化/tonghua', '波尔多', '路易拉菲/louis lafon ', '贺兰山', '名庄靓年', '拉蒙', '香奈/j.p.chenet ', '木桐', '丰收'  等等\n",
    "\n",
    "D\t'罗马假日', '君顶', '优尼特/riunite', '阿维娃/aviva', '蓝海之鲸/mr.sparkling'\n",
    "\n",
    "F\t'木桐古堡/ch. mouton rothschild', '拉图酒庄', '作品一号/opus one'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(222, 7)\n"
     ]
    }
   ],
   "source": [
    "level = pd.DataFrame(list(set(winedf['keyword'])),columns=[\"keyword\"])\n",
    "level = pd.merge(level,classA,how='left',on=[\"keyword\"])\n",
    "level = pd.merge(level,classB,how='left',on=[\"keyword\"])\n",
    "level = pd.merge(level,classC,how='left',on=[\"keyword\"])\n",
    "level = pd.merge(level,classD,how='left',on=[\"keyword\"])\n",
    "level = pd.merge(level,classF,how='left',on=[\"keyword\"])\n",
    "level = level.fillna(0)\n",
    "\n",
    "level['max_value']=level.max(axis=1)\n",
    "level=level[level.max_value>0]\n",
    "print(level.shape)\n",
    "def appendlevel(sr):\n",
    "    levels=list('ABCDF')\n",
    "    for i in levels:\n",
    "        if sr[i] == sr['max_value']:\n",
    "            return i\n",
    "level['level'] = level.apply(lambda x:appendlevel(x),axis=1)#每一行apply"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "level\n",
       "A     54\n",
       "B    109\n",
       "C     51\n",
       "D      5\n",
       "F      3\n",
       "dtype: int64"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "level.groupby('level').size()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['木桐古堡/ch. mouton rothschild', '拉图酒庄', '作品一号/opus one']"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "tmp = level[level.level=='F'].sort_values(\"max_value\",ascending=False)\n",
    "list(tmp[tmp.max_value>=0]['keyword'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 步骤5:给微博数据加标签\n",
    "把清洗微博数据的结果和同级红酒的处理结果做合并\n",
    "<div class=\"alert alert-success alertsuccess\" style=\"margin-top: 10px\">\n",
    "代码示例：\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [],
   "source": [
    "result = pd.merge(df,level[[\"keyword\",\"level\"]],how=\"left\",on=\"keyword\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "(22524, 6)\n"
     ]
    }
   ],
   "source": [
    "# 删除level为空的数据\n",
    "result.dropna(axis='index',inplace=True)\n",
    "print(result.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 步骤6:保存结果\n",
    "保存为 level.csv 文件，下一节采用pandaBI的大屏展示显示相关数据\n",
    "<div class=\"alert alert-success alertsuccess\" style=\"margin-top: 10px\">\n",
    "代码示例：\n",
    "</div>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [],
   "source": [
    "result.to_csv(\"level.csv\",encoding=\"utf-8_sig\",index = False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 五、实验总结\n",
    "本节通过聚类分析得到同级红酒列表，并生成红酒打标文件 level.csv，下一节将与第三章的数据进行结合实现关注人特征发现的应用。\n",
    "<div class=\"alert alert-block alert-info\" style=\"margin-top: 20px\" >\n",
    "<li><a href=\"#ref\" >返回目录表</a></li>   \n",
    "<br>\n",
    "<p></p>\n",
    "</div>"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
